On the Empirical Complexity of Text Classification Problems
نویسندگان
چکیده
In order to train a classifier that generalizes well, different learning problems, in particular high-dimensional ones such as text classification, can require widely different amounts of training, as measured in terms of the number of training instances required to reach adequate accuracy or the number of features effectively utilized in the classifier. We define several measures of learning difficulty and explore their utility in approximately capturing the inherent complexity of text classification problems. These measures can be efficiently computed for real-world problems for which linear classifiers are effective. We observe an intimate relationship (a high positive correlation) between feature complexity and instance complexity when using the measures. Such measures of difficulty are useful for comparing learning problems and corpora, and gaining insight into the variable success of methods such as active learning. In particular, we quantify the difficulty of 358 text classification problems and 9 corpora using the proposed measures, including those popularly used for benchmarking the performance of text classification algorithms, such as the Reuters and 20 Newsgroups corpora. We demonstrate the spectrum of problems that exist in text classification in addition to quantifying results that have only been qualitatively discussed in the text classification literature. We observe that many problems in the commonly used data sets are of low to medium complexity, that is, only roughly tens of well-selected features are required to gain most of the maximum attained performance on such concepts, when using linear classifiers. We find that learning for such types of problems especially stands to benefit from incorporating feature feedback (prior knowledge on features) into active learning techniques.
منابع مشابه
Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملAn Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification
In recent years, production of text documents has seen an exponential growth, which is the reason why their proper classification seems necessary for better access. One of the main problems of classifying text documents is working in high-dimensional feature space. Feature Selection (FS) is one of the ways to reduce the number of text attributes. So, working with a great bulk of the feature spa...
متن کاملAn Improved Flower Pollination Algorithm with AdaBoost Algorithm for Feature Selection in Text Documents Classification
In recent years, production of text documents has seen an exponential growth, which is the reason why their proper classification seems necessary for better access. One of the main problems of classifying text documents is working in high-dimensional feature space. Feature Selection (FS) is one of the ways to reduce the number of text attributes. So, working with a great bulk of the feature spa...
متن کاملارائه روشی برای استخراج کلمات کلیدی و وزندهی کلمات برای بهبود طبقهبندی متون فارسی
Due to ever-increasing information expansion and existing huge amount of unstructured documents, usage of keywords plays a very important role in information retrieval. Because of a manually-extraction of keywords faces various challenges, their automated extraction seems inevitable. In this research, it has been tried to use a thesaurus, (a structured word-net) to automatically extract them. A...
متن کاملAn Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification
Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009